Good morning, everybody. How are you guys awake? Or sober?
Who slept in this room last night? That's the only reason you're here.
One guy. Okay. So this is GitDigger creating useful wordlets from GitHub. I'm Wick.
I'm Movix. Go ahead. Explain it.
So last night at random ‑‑ well, not random for Movix, but we ran into a taxi line,
decided to go with them over to Pawn Stars. Everybody know Pawn Stars?
So inside we're walking around, we're looking at the souvenirs, and all of a sudden we notice
this kiosk. Everybody's using it. What's that? Well, we walk up to it, and it has a camera.
You can take a picture of yourself, and they allow you to log in with your user name and
password.
Password to Facebook, Twitter, to send the image to yourself or tweet it out to the public.
So I e‑mail it to myself. I'm not giving them anything.
And this is the result on the screen. Legit, right?
So I did most of the research. I did all the research.
That's me. Yeah, that's him.
Here. So we're not the first ones to make word lists. In 2009 and 2012, Sebastian Riv,
French something, made word lists from Wikipedia. He's an awesome guy. I'm not trying to make
fun of him. But also all of Matt Weir's stuff, if you haven't used Matt Weir's keyboard dictionary,
it's one of the best ones to find people who just used, you know, along the way.
And the other people who make awesome word lists are Rakiu. Going on.
Okay.
Go ahead.
So we weren't the first to go digging through source code repositories. Malvituna Security,
they released SVN digger, where they went through a ton of SVN repositories, looped
through and then published the frequency count of all the files and all the directories
that they found and pulled down from ‑‑ I forget exactly where they pulled them down
from. Google code.
Google code.
Google code.
So just to point out really quick in the slide, if you take a picture of that QR code, we're
not trying to hack you. It's actually linked to the information. So it's ‑‑
I made them, not him. So they're good to go.
The only problem with using Google code and stuff like that is they like to put these
captures in, which make it really hard to automate stuff.
So this is how everything got started.
2 o'clock in the morning. Somebody posts a link to SVN digger. Everybody thinks it's
cool. I haven't seen anything like it before then. Rob's like, that's awesome. And that
one line, that's the only reason he's standing up here right now, because of that one line
of code.
One line.
So I'm like, oh, this is awesome. I can do this crap in 30 minutes or so. I'll go to
bed, wake up in the morning, code will be done. And ‑‑
I'll have an awesome word list.
So my first problem was that I couldn't find ‑‑ at 2 in the morning, mind you, I couldn't
find a good way to get all the repositories. So I started to go to their ‑‑ GitHub
list is the top repositories, the most forked. So I used some Python and started web scraping
all of that. So used some basic Python. I'm web scraping that.
Saving it into SQLite, the user names and the project names. And then just set my computer
loose cloning all the repositories. So now what do I do with it? I have these repositories.
I am using OS walk to go through each repository and keep a count of the user ‑‑ the file
name and the directory. I'm doing a whole lot of sed, grep, awk, just trying to keep
clean everything up, make it nice and easy. There was a ton of manual review, because
I thought it would be easy to go through and pull out all the user names and passwords
and e‑mail addresses I found in this code. So I spent about 17 hours total on my 30‑minute
project. And all kinds of hours trying to pull out user names and passwords and I got
a mile line of stuff.
It was said that I just copy and paste and come back later. So OS walk was taking forever
to go through and find everything. And I'm like, there's got to be a better way to do
this. So after some Google fool, I found better walk, which claims that OS walk makes
unnecessary API calls to go, is this a folder? Is this a file? We don't know. API, please
tell me.
And they cut that out of their loop, which speeds things up to 2 1⁄2 times.
So the good news is I've got some awesome word lists. And I posted them out on IRC.
Everybody loved them. I was like, great. But the bad news is I've only got some repositories.
I've got maybe the most popular repositories. And that's it.
SQL transactions.
Right.
They were extremely slow. It took maybe about 30 seconds to go, is this already in my table?
Yes? Okay. Let's add one to the count. And the 17 hours of manual labor really sucked
because I am the laziest bastard on the planet. If I could have got my goon to carry me in
here, I would have. And my hard drive was full. I've had terabytes of this data. So
everybody liked it. So I'm like, okay, let's get a little serious. How can I make this
better? How can I streamline it? How can I not do 17 hours of manual labor?
First problem, storage. How am I going to store all the data? So my first thought, did
some Googling and BitCasa, awesome. $99 a year, unlimited space, built in indexing so
I could give other people access to all the code and they could search for whatever in
the world they want. And it worked. It worked. It worked. It worked. It worked. It worked. It
worked. It worked. It worked. At that time, six months ago, at that time, there was only
a Windows client. It crashed every time I tried to launch a Robocopy or just simple copy
and paste. And it was extremely slow because they encrypted all the data on the upswing.
So what might have taken me six days to upload a terabyte with my slow ass connection would
have taken like a month. The next option, which I thought was the option, was to have
a NAS. Everything was stored in one place. It was protected. I could download directly
to it. But it's hard to get free money for these things. So I had three terabytes already.
So my solution right there is the first ten terabytes of all the data.
That's awesome. So the next problem is how can I make downloading
these repositories better, easier? How can I get all of the repositories? So when I was
actually awake, I found the API, which I felt incredibly stupid not knowing about.
And it's nice because the API gives you all kinds of nice
useful information. The only thing that I haven't found is they'll tell you that it's
a fork of a project, but they don't tell you who was the main project, who it was forked
from. So I can keep track of how popular a project is, but I have no idea which guy
was the original. So database. SQLite sucks really bad when you're trying to
store a lot of data. So I switched to MySQL. I've had questions in the past, why didn't
I use Postgres? Well, I know MySQL. And again, I'm lazy. Didn't want to learn something new.
So let's put this all together now. So now I've got two main scripts. I've got the first
Python script that is threaded, goes through, downloads all the data. It's got another mode
that will go through and process all that data. And then I have another script which
I'll talk a little bit more about that just takes a long list of user names, passwords,
e‑mail addresses, and I pass it a table name and it just goes and dumps all the data
into that table. The MySQL database, that is what I upgraded the most. I actually created
a table
I
to keep track of more product information, more project information, and the user names
and passwords and everything now has its own table. And I'm keeping track of the last seen
ID so that I don't have to start over or repeat myself.
So here's how the downloading works. Downloader goes out to the API and says give me 100 repositories.
I've already seen 5,000. So GitHub comes back at you and says, okay, here's the next 100.
So it downloads it, dumps it into the database that I've got it, and then automatically clones
the repository to my hard drive. Unfortunately, the processing got a little better, but there's
still a lot of manual work. So now the processor mode is checking my database going, okay,
I don't have this repository, but I know it exists.
It downloads it, great. Or it goes through and autoloops it, does the better
walk on it. And now if you notice the red line, that's all my manual work. So I have
to grep all this data, pull out usernames, passwords, e-mails, RSA keys, all kinds of
fun stuff. And then clean it up, which can take for
a one-griffy, two-wheel-technology task. Alright, here's what I'm doing. One thing
prep session for one day can take four days for me to go through and clean it all up and
dump it into the database. And then I have a bash script that will just go connect to
the database, dump everything, and create the word list and automatically send it back
up to GitHub, which is a real irony. I'm downloading all their data and yet storing it on GitHub.
So the updated news, I now have all the repositories. I can now get every single public one.
Generating the word list with the bash script takes minutes once everything is in the database.
Because of the updates I did to the database, I can now store the repositories on any hard
drive I want, search the database, it will tell me which one to go to get. The sucky
part about that...
if I want to go back and grep for more stuff, I have to get this giant hub and plug all
these hard drives in at the same time. It's awesome. You should see it.
Yeah. I'm estimating that it's going to take about 30 terabytes to download all the public
repositories. However, I am pulling that number out of my butt based off of the first ‑‑
the amount of repositories I got from the first 10 terabytes. Because everybody is uploading
new stuff every single day. I could probably continue with this project forever and never
see the end of GitHub. So this is the big data drinking game. If
you just heard me say big data, drink. But you guys are all hungover so I'm not going
to ask you to do it. So obviously this is a buildup to the
actual word list. What did we get out of it? So anyone with kids knows exactly how
this goes. So all together. If you don't see kids, if you don't have kids, you can just
get the movie and fast forward to that part because it is like the best part of the whole
movie. All right. So all directories, all files, usernames, passwords, these are pretty
straightforward lists. But the cool thing about them is what you see inside of them. And you
know, we're not just talking about password lists. Password lists, you know, that's the
obvious use, right? I'm going to have a set of passwords that I'm going to use against
it. The all directories list and all files list is awesome when you're talking about
web application attacks. And the usernames, I didn't know that so many people loved Bob.
But they do. More than admin. So stats. So this is ‑‑
Pretty pictures. Yeah. I promise this is the only thing I'm going to use. I'm going to use this.
I just wanted to give an overview of how many passwords are in the database versus
how many of them are actually unique to each section.
So this is where it gets relevant to what I do. I'm a senior red teamer. And one of
the things that ‑‑ I just break stuff. So I already talked about forced browsing.
The SVN digger kind of started that whole thing. The great thing about forced browsing
is when you ‑‑ you know, when you have a lot of data, you don't have to worry about
getting a set of directories or word lists or stuff like that. You can just exactly like
Durbuster, you just go through it and find it. You can actually use these lists with
Durbuster. The small default password list, which is not exactly the same thing that I
would have expected as the default passwords. You start with root, tor, blah. This actually
got a lot of different ones. Static salts. It's hilarious when you find
a repository that has a salt for passwords and then that repository is used as an application
out there in the real world. I actually stopped pulling out static salts
because there is so many. And I went, I'm never going to get this done in time to do
a CFP on the project if all I did was pull out the static salts.
So ‑‑ five minutes. So number 22 on the list of
files is exception.PHP. I have never, ever looked for that when I was looking at a web
application, even a PHP one. But after Wick had done his research and shared the list,
I used it, got code execution because the exception.PHP was actually loading the exception
information from a file and you could just specify any file you want. So it's on my list
now. File.PHP I'm going to keep going through because there's five minutes left. That's
burp, forced browsing. You guys all know how to do that.
SSH1 auth keys we found. It is pretty awesome.
And this is one of my favorites. NTLMSSOMagic. Anyone know what that does? It has your user
name and password statically assigned, so it does NTLM.
All right. So real world stuff. Anyone see this release, this web application code?
The secret tokens for Rails, if you have a secret token stored in your repository and
it's also used in your production without you changing it, it's direct remote code execution.
So this is the gentleman, I'm going to butcher his name ‑‑ I won't butcher his name,
but he sent out an e‑mail, he was nice about it and sent out an e‑mail to all 1,000 users
who had this in their repositories.
I am much too lazy to do any of that.
So the not so obvious stuff is you start parsing every file from the get revision history.
Right now WIC isn't, but if you store your password, then the gentleman who just said
it removes it, but you can go back in the history if you don't nuke it.
Mass data code analysis, so you can find vulnerabilities in a ton of things really quickly.
Dot SVN and dot settings are amazing things for some reason that are ‑‑ when you
convert a SVN repository into a Git repository, sometimes people forget to delete those things
and they can have configs, including database configs and all kinds of things.
Git ignore is an amazing little file that tells your Git repository what files to never
look to commit.
Those are exactly the files that I want to look for because those are the things that
are important, so I usually look for that.
403s, empty directories and DS stores, this is on GitHub or on Git, it doesn't ‑‑ as
well as SVN, it doesn't let you create a directory and commit it unless there's something
in it.
So empty directory and DS stores are usually how some people do it.
Another thing is running OCR on all the images.
We actually found a gentleman that ‑‑ or a girl that had their pot ‑‑ their
password stored in an image for the repository, it was awesome.
Using a list of text files, graphing out all the e‑mails, which he already does, and
I'm stopping because it gives all the ideas, and we're done.
Thank you.
I actually want to give a quick thank you to Nova hackers.
Are there any Nova hackers in the room?
Y'all suck.
They suck.
But without their help and support and encouragement, I would have never kept going with this project
because they help me out with resources.
I now have a file server which can store up to 34 terabytes of data, so once I get the
original 10 bytes switched over, I'm going to start downloading and pulling out some
more stuff.
Cool stuff.
I know everyone is waiting for the next talk.
Questions?
All right.
Cool.
Thanks.
Thanks for coming.
Did you do indexing on the SQLite database?
No.
No, I did not.
And I probably should have.
I'm not a programmer.
I'm a problem solver.
Okay.
Cool.
Well, I've had one ‑‑ I gave this talk at B-Sides, and I had one guy come up to me
and mention, well, why didn't you just do it in memory?
I'm like, oh, I have three terabytes worth of data to go through.
I don't think my computer would live to do that and store the database in memory.
I'm afraid.
Maybe.
I don't know.
Thanks a lot, guys.
I appreciate it.
